You are a Data Scientist for a tourism company named "Visit with us". The Policy Maker of the company wants to enable and establish a viable business model to expand the customer base.
A viable business model is a central concept that helps you to understand the existing ways of doing the business and how to change the ways for the benefit of the tourism sector.
One of the ways to expand the customer base is to introduce a new offering of packages.
Currently, there are 5 types of packages the company is offering - Basic, Standard, Deluxe, Super Deluxe, King. Looking at the data of the last year, we observed that 18% of the customers purchased the packages.
However, the marketing cost was quite high because customers were contacted at random without looking at the available information.
The company is now planning to launch a new product i.e. Wellness Tourism Package. Wellness Tourism is defined as Travel that allows the traveler to maintain, enhance or kick-start a healthy lifestyle, and support or increase one's sense of well-being.
However, this time company wants to harness the available data of existing and potential customers to make the marketing expenditure more efficient.
You as a Data Scientist at "Visit with us" travel company have to analyze the customers' data and information to provide recommendations to the Policy Maker and Marketing Team and also build a model to predict the potential customer who is going to purchase the newly introduced travel package.
To predict which customer is more likely to purchase the newly introduced travel package.
Customer details:
CustomerID: Unique customer ID ProdTaken: Whether the customer has purchased a package or not (0: No, 1: Yes) Age: Age of customer TypeofContact: How customer was contacted (Company Invited or Self Inquiry) CityTier: City tier depends on the development of a city, population, facilities, and living standards. The categories are ordered i.e. Tier 1 > Tier 2 > Tier 3 Occupation: Occupation of customer Gender: Gender of customer NumberOfPersonVisiting: Total number of persons planning to take the trip with the customer PreferredPropertyStar: Preferred hotel property rating by customer MaritalStatus: Marital status of customer NumberOfTrips: Average number of trips in a year by customer Passport: The customer has a passport or not (0: No, 1: Yes) OwnCar: Whether the customers own a car or not (0: No, 1: Yes) NumberOfChildrenVisiting: Total number of children with age less than 5 planning to take the trip with the customer Designation: Designation of the customer in the current organization MonthlyIncome: Gross monthly income of the customer Customer interaction data:
PitchSatisfactionScore: Sales pitch satisfaction score ProductPitched: Product pitched by the salesperson NumberOfFollowups: Total number of follow-ups has been done by the salesperson after the sales pitch DurationOfPitch: Duration of the pitch by a salesperson to the customer Note:
Please note XGBoost can take a significantly longer time to run, so if you have time complexity issues then you can avoid tuning XGBoost. No marks will be deducted if XGBoost tuning is not attempted.
Let's start by importing libraries we need.
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from sklearn.ensemble import StackingClassifier
from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score
from sklearn import tree
! pip install xgboost
from sklearn.ensemble import BaggingRegressor,RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor, StackingRegressor
from xgboost import XGBRegressor
from sklearn import metrics
from sklearn.model_selection import GridSearchCV, train_test_split
%autosave 15
Requirement already satisfied: xgboost in d:\programs\anaconda3\lib\site-packages (1.5.0) Requirement already satisfied: numpy in d:\programs\anaconda3\lib\site-packages (from xgboost) (1.19.2) Requirement already satisfied: scipy in d:\programs\anaconda3\lib\site-packages (from xgboost) (1.5.2)
Autosaving every 15 seconds
#Loading dataset
data=pd.read_excel("Tourism.xlsx",sheet_name=1)
View the first 5 rows of the dataset.
data.head()
| CustomerID | ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 200000 | 1 | 41.0 | Self Enquiry | 3 | 6.0 | Salaried | Female | 3 | 3.0 | Deluxe | 3.0 | Single | 1.0 | 1 | 2 | 1 | 0.0 | Manager | 20993.0 |
| 1 | 200001 | 0 | 49.0 | Company Invited | 1 | 14.0 | Salaried | Male | 3 | 4.0 | Deluxe | 4.0 | Divorced | 2.0 | 0 | 3 | 1 | 2.0 | Manager | 20130.0 |
| 2 | 200002 | 1 | 37.0 | Self Enquiry | 1 | 8.0 | Free Lancer | Male | 3 | 4.0 | Basic | 3.0 | Single | 7.0 | 1 | 3 | 0 | 0.0 | Executive | 17090.0 |
| 3 | 200003 | 0 | 33.0 | Company Invited | 1 | 9.0 | Salaried | Female | 2 | 3.0 | Basic | 3.0 | Divorced | 2.0 | 1 | 5 | 1 | 1.0 | Executive | 17909.0 |
| 4 | 200004 | 0 | NaN | Self Enquiry | 1 | 8.0 | Small Business | Male | 2 | 3.0 | Basic | 4.0 | Divorced | 1.0 | 0 | 5 | 1 | 0.0 | Executive | 18468.0 |
Check data types and number of non-null values for each column.
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4888 entries, 0 to 4887 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CustomerID 4888 non-null int64 1 ProdTaken 4888 non-null int64 2 Age 4662 non-null float64 3 TypeofContact 4863 non-null object 4 CityTier 4888 non-null int64 5 DurationOfPitch 4637 non-null float64 6 Occupation 4888 non-null object 7 Gender 4888 non-null object 8 NumberOfPersonVisiting 4888 non-null int64 9 NumberOfFollowups 4843 non-null float64 10 ProductPitched 4888 non-null object 11 PreferredPropertyStar 4862 non-null float64 12 MaritalStatus 4888 non-null object 13 NumberOfTrips 4748 non-null float64 14 Passport 4888 non-null int64 15 PitchSatisfactionScore 4888 non-null int64 16 OwnCar 4888 non-null int64 17 NumberOfChildrenVisiting 4822 non-null float64 18 Designation 4888 non-null object 19 MonthlyIncome 4655 non-null float64 dtypes: float64(7), int64(7), object(6) memory usage: 763.9+ KB
isna() method.data.isna().sum()
CustomerID 0 ProdTaken 0 Age 226 TypeofContact 25 CityTier 0 DurationOfPitch 251 Occupation 0 Gender 0 NumberOfPersonVisiting 0 NumberOfFollowups 45 ProductPitched 0 PreferredPropertyStar 26 MaritalStatus 0 NumberOfTrips 140 Passport 0 PitchSatisfactionScore 0 OwnCar 0 NumberOfChildrenVisiting 66 Designation 0 MonthlyIncome 233 dtype: int64
Summary of the dataset
# filtering object type columns
cat_columns = data.describe(include=["object"]).columns
cat_columns
Index(['TypeofContact', 'Occupation', 'Gender', 'ProductPitched',
'MaritalStatus', 'Designation'],
dtype='object')
# filtering continous type columns
cont_columns = data.describe(include=["float64","int64"]).columns
cont_columns
Index(['CustomerID', 'ProdTaken', 'Age', 'CityTier', 'DurationOfPitch',
'NumberOfPersonVisiting', 'NumberOfFollowups', 'PreferredPropertyStar',
'NumberOfTrips', 'Passport', 'PitchSatisfactionScore', 'OwnCar',
'NumberOfChildrenVisiting', 'MonthlyIncome'],
dtype='object')
#filtering continous columns which are missing values
data[cont_columns].isna().sum()
CustomerID 0 ProdTaken 0 Age 226 CityTier 0 DurationOfPitch 251 NumberOfPersonVisiting 0 NumberOfFollowups 45 PreferredPropertyStar 26 NumberOfTrips 140 Passport 0 PitchSatisfactionScore 0 OwnCar 0 NumberOfChildrenVisiting 66 MonthlyIncome 233 dtype: int64
# Summary of continuous columns
data[['ProdTaken', 'Age', 'CityTier', 'DurationOfPitch',
'NumberOfPersonVisiting', 'NumberOfFollowups', 'PreferredPropertyStar',
'NumberOfTrips', 'Passport', 'PitchSatisfactionScore', 'OwnCar',
'NumberOfChildrenVisiting', 'MonthlyIncome']].describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| ProdTaken | 4888.0 | 0.188216 | 0.390925 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| Age | 4662.0 | 37.622265 | 9.316387 | 18.0 | 31.0 | 36.0 | 44.0 | 61.0 |
| CityTier | 4888.0 | 1.654255 | 0.916583 | 1.0 | 1.0 | 1.0 | 3.0 | 3.0 |
| DurationOfPitch | 4637.0 | 15.490835 | 8.519643 | 5.0 | 9.0 | 13.0 | 20.0 | 127.0 |
| NumberOfPersonVisiting | 4888.0 | 2.905074 | 0.724891 | 1.0 | 2.0 | 3.0 | 3.0 | 5.0 |
| NumberOfFollowups | 4843.0 | 3.708445 | 1.002509 | 1.0 | 3.0 | 4.0 | 4.0 | 6.0 |
| PreferredPropertyStar | 4862.0 | 3.581037 | 0.798009 | 3.0 | 3.0 | 3.0 | 4.0 | 5.0 |
| NumberOfTrips | 4748.0 | 3.236521 | 1.849019 | 1.0 | 2.0 | 3.0 | 4.0 | 22.0 |
| Passport | 4888.0 | 0.290917 | 0.454232 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
| PitchSatisfactionScore | 4888.0 | 3.078151 | 1.365792 | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 |
| OwnCar | 4888.0 | 0.620295 | 0.485363 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
| NumberOfChildrenVisiting | 4822.0 | 1.187267 | 0.857861 | 0.0 | 1.0 | 1.0 | 2.0 | 3.0 |
| MonthlyIncome | 4655.0 | 23619.853491 | 5380.698361 | 1000.0 | 20346.0 | 22347.0 | 25571.0 | 98678.0 |
# creating a copy of the data so that original data remains unchanged
df = data.copy()
Number of unique values in each column
df.nunique()
CustomerID 4888 ProdTaken 2 Age 44 TypeofContact 2 CityTier 3 DurationOfPitch 34 Occupation 4 Gender 2 NumberOfPersonVisiting 5 NumberOfFollowups 6 ProductPitched 5 PreferredPropertyStar 3 MaritalStatus 4 NumberOfTrips 12 Passport 2 PitchSatisfactionScore 5 OwnCar 2 NumberOfChildrenVisiting 4 Designation 5 MonthlyIncome 2475 dtype: int64
#Dropping CustomerID columns from the dataframe
df.drop(columns=['CustomerID'], inplace=True)
# let us reset the dataframe index
df.reset_index(inplace=True, drop=True)
#filtering categorical columns which are missing values
# filtering object type columns
# #filtering object type columns which are missing values
cat_columns = df.describe(include=["object"]).columns
cat_columns
df[cat_columns].isna().sum()
TypeofContact 25 Occupation 0 Gender 0 ProductPitched 0 MaritalStatus 0 Designation 0 dtype: int64
df.TypeofContact.fillna("Others", inplace=True)
## Checking the Values on unique values for TypeofContact
df.TypeofContact.value_counts()
Self Enquiry 3444 Company Invited 1419 Others 25 Name: TypeofContact, dtype: int64
# filtering continous type columns
# filtering continous type columns
# #filtering continous type columns which are missing values
cont_columns = df.describe(include=["float64","int64"]).columns
cont_columns
df[cont_columns].isna().sum()
ProdTaken 0 Age 226 CityTier 0 DurationOfPitch 251 NumberOfPersonVisiting 0 NumberOfFollowups 45 PreferredPropertyStar 26 NumberOfTrips 140 Passport 0 PitchSatisfactionScore 0 OwnCar 0 NumberOfChildrenVisiting 66 MonthlyIncome 233 dtype: int64
## Checking the summary statistics of all continous variables before imputing
df[[ 'Age', 'DurationOfPitch','NumberOfFollowups', 'PreferredPropertyStar',
'NumberOfTrips','NumberOfChildrenVisiting', 'MonthlyIncome']].describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Age | 4662.0 | 37.622265 | 9.316387 | 18.0 | 31.0 | 36.0 | 44.0 | 61.0 |
| DurationOfPitch | 4637.0 | 15.490835 | 8.519643 | 5.0 | 9.0 | 13.0 | 20.0 | 127.0 |
| NumberOfFollowups | 4843.0 | 3.708445 | 1.002509 | 1.0 | 3.0 | 4.0 | 4.0 | 6.0 |
| PreferredPropertyStar | 4862.0 | 3.581037 | 0.798009 | 3.0 | 3.0 | 3.0 | 4.0 | 5.0 |
| NumberOfTrips | 4748.0 | 3.236521 | 1.849019 | 1.0 | 2.0 | 3.0 | 4.0 | 22.0 |
| NumberOfChildrenVisiting | 4822.0 | 1.187267 | 0.857861 | 0.0 | 1.0 | 1.0 | 2.0 | 3.0 |
| MonthlyIncome | 4655.0 | 23619.853491 | 5380.698361 | 1000.0 | 20346.0 | 22347.0 | 25571.0 | 98678.0 |
Number of observations in each category
## Create a copy of the DF before imputing values
df1=df.copy()
df1[[ 'Age', 'DurationOfPitch','NumberOfFollowups', 'PreferredPropertyStar',
'NumberOfTrips','NumberOfChildrenVisiting', 'MonthlyIncome']] =df[[ 'Age', 'DurationOfPitch','NumberOfFollowups', 'PreferredPropertyStar',
'NumberOfTrips','NumberOfChildrenVisiting', 'MonthlyIncome']].transform(lambda x: x.fillna(x.median()))
df1.isnull().sum()
ProdTaken 0 Age 0 TypeofContact 0 CityTier 0 DurationOfPitch 0 Occupation 0 Gender 0 NumberOfPersonVisiting 0 NumberOfFollowups 0 ProductPitched 0 PreferredPropertyStar 0 MaritalStatus 0 NumberOfTrips 0 Passport 0 PitchSatisfactionScore 0 OwnCar 0 NumberOfChildrenVisiting 0 Designation 0 MonthlyIncome 0 dtype: int64
df1.isnull().values.any() # If there are any null values in data set
False
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
df1.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4888 entries, 0 to 4887 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ProdTaken 4888 non-null int64 1 Age 4888 non-null float64 2 TypeofContact 4888 non-null object 3 CityTier 4888 non-null int64 4 DurationOfPitch 4888 non-null float64 5 Occupation 4888 non-null object 6 Gender 4888 non-null object 7 NumberOfPersonVisiting 4888 non-null int64 8 NumberOfFollowups 4888 non-null float64 9 ProductPitched 4888 non-null object 10 PreferredPropertyStar 4888 non-null float64 11 MaritalStatus 4888 non-null object 12 NumberOfTrips 4888 non-null float64 13 Passport 4888 non-null int64 14 PitchSatisfactionScore 4888 non-null int64 15 OwnCar 4888 non-null int64 16 NumberOfChildrenVisiting 4888 non-null float64 17 Designation 4888 non-null object 18 MonthlyIncome 4888 non-null float64 dtypes: float64(7), int64(6), object(6) memory usage: 725.7+ KB
histogram_boxplot(df1, "Age", bins=100)
histogram_boxplot(df1, "CityTier", bins=100)
histogram_boxplot(df1, "DurationOfPitch", bins=100)
## generating log transformation on column lg_DurationOfPitch to fcorrect the skewness in the distribution
df1['lg_DurationOfPitch']=np.log10(df1['DurationOfPitch'])
histogram_boxplot(df1, "lg_DurationOfPitch", bins=100)
histogram_boxplot(df1, "NumberOfPersonVisiting", bins=100)
histogram_boxplot(df1, "NumberOfFollowups", bins=200)
histogram_boxplot(df1, "PreferredPropertyStar", bins=200)
histogram_boxplot(df1, "NumberOfTrips", bins=200)
histogram_boxplot(df1, "NumberOfChildrenVisiting", bins=200)
#Top 5 highest Target variable values
data['ProdTaken'].nlargest()
0 1 2 1 14 1 21 1 24 1 Name: ProdTaken, dtype: int64
## Alternative technique of just observing the distributions
columns = list(df1)[0:-1] # Excluding Outcome column which has only
cont_columns = df.describe(include=["float64","int64"]).columns
df1[columns].hist(stacked=False, bins=100, figsize=(12,30), layout=(14,2));
# Histogram of first 8 columns
## cat_cols=['season','yr','holiday','workingday','weathersit']
cat_columns = df1.describe(include=["object"]).columns
for column in cat_columns:
print(df1[column].value_counts())
print('-'*30)
Self Enquiry 3444 Company Invited 1419 Others 25 Name: TypeofContact, dtype: int64 ------------------------------ Salaried 2368 Small Business 2084 Large Business 434 Free Lancer 2 Name: Occupation, dtype: int64 ------------------------------ Male 2916 Female 1972 Name: Gender, dtype: int64 ------------------------------ Basic 1842 Deluxe 1732 Standard 742 Super Deluxe 342 King 230 Name: ProductPitched, dtype: int64 ------------------------------ Married 2340 Divorced 950 Single 916 Unmarried 682 Name: MaritalStatus, dtype: int64 ------------------------------ Executive 1842 Manager 1732 Senior Manager 742 AVP 342 VP 230 Name: Designation, dtype: int64 ------------------------------
## Adding a new column type boolean as target column
df1['ProdTaken_bol'] = df1['ProdTaken'].astype('bool')
df1.sample(10)
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | ... | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | lg_DurationOfPitch | ProdTaken_bol | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2541 | 0 | 29.0 | Company Invited | 3 | 11.0 | Small Business | Male | 3 | 4.0 | Deluxe | ... | Divorced | 3.0 | 0 | 2 | 1 | 1.0 | Manager | 22899.0 | 1.041393 | False |
| 387 | 0 | 40.0 | Self Enquiry | 3 | 8.0 | Small Business | Female | 2 | 3.0 | Deluxe | ... | Married | 1.0 | 0 | 2 | 0 | 1.0 | Manager | 20715.0 | 0.903090 | False |
| 1227 | 0 | 36.0 | Self Enquiry | 1 | 8.0 | Salaried | Male | 3 | 3.0 | Basic | ... | Married | 2.0 | 0 | 5 | 0 | 2.0 | Executive | 18477.0 | 0.903090 | False |
| 3607 | 0 | 38.0 | Self Enquiry | 1 | 17.0 | Small Business | Female | 4 | 4.0 | Basic | ... | Married | 3.0 | 0 | 1 | 1 | 1.0 | Executive | 22614.0 | 1.230449 | False |
| 702 | 0 | 30.0 | Self Enquiry | 3 | 14.0 | Salaried | Male | 3 | 3.0 | Standard | ... | Married | 6.0 | 0 | 3 | 1 | 0.0 | Senior Manager | 22264.0 | 1.146128 | False |
| 1063 | 0 | 29.0 | Self Enquiry | 3 | 25.0 | Salaried | Male | 3 | 4.0 | Deluxe | ... | Married | 2.0 | 0 | 4 | 1 | 0.0 | Manager | 23620.0 | 1.397940 | False |
| 3283 | 0 | 43.0 | Self Enquiry | 3 | 11.0 | Small Business | Male | 3 | 4.0 | Deluxe | ... | Unmarried | 2.0 | 0 | 5 | 1 | 2.0 | Manager | 23833.0 | 1.041393 | False |
| 1832 | 0 | 21.0 | Company Invited | 3 | 15.0 | Small Business | Male | 2 | 3.0 | Basic | ... | Single | 2.0 | 0 | 4 | 1 | 0.0 | Executive | 17610.0 | 1.176091 | False |
| 1777 | 0 | 38.0 | Self Enquiry | 1 | 31.0 | Salaried | Female | 2 | 4.0 | Standard | ... | Married | 4.0 | 0 | 3 | 0 | 1.0 | Senior Manager | 27061.0 | 1.491362 | False |
| 1035 | 0 | 31.0 | Self Enquiry | 2 | 14.0 | Small Business | Female | 3 | 1.0 | Basic | ... | Single | 1.0 | 0 | 1 | 0 | 2.0 | Executive | 17109.0 | 1.146128 | False |
10 rows × 21 columns
Function to create barplots that indicate percentage for each category
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
labeled_barplot(df1, "CityTier",perc=True)
labeled_barplot(df1,'Occupation',perc=True)
labeled_barplot(df1,'Gender',perc=True)
labeled_barplot(df1,'MaritalStatus',perc=True)
labeled_barplot(df1,'OwnCar',perc=True)
labeled_barplot(data,'Designation',perc=True)
plt.figure(figsize=(15, 7))
sns.heatmap(df1.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
sns.pairplot(
df1,
x_vars=["MonthlyIncome", "Age", "NumberOfTrips"],
y_vars=["ProdTaken_bol"],
height=4,
aspect=1
);
sns.pairplot(df1, hue="ProdTaken")
plt.show()
### function to plot distributions wrt target
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
stat="density",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
stat="density",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)
plt.tight_layout()
plt.show()
distribution_plot_wrt_target(df1, "MaritalStatus", "ProdTaken_bol")
distribution_plot_wrt_target(df1, "MonthlyIncome", "OwnCar")
distribution_plot_wrt_target(df1, "MonthlyIncome", "ProdTaken_bol")
distribution_plot_wrt_target(df1, "Designation", "ProdTaken_bol")
distribution_plot_wrt_target(df1, "ProductPitched", "ProdTaken_bol")
distribution_plot_wrt_target(df1, "PitchSatisfactionScore", "ProdTaken_bol")
sns.catplot(x="ProductPitched", y="Age", data=df1, kind='bar', height=6, aspect=1.6, estimator=np.mean);
sns.catplot(x="PitchSatisfactionScore", y="ProdTaken_bol", data=df1, kind='bar', height=6, aspect=1.6, estimator=np.mean);
Data Cleaning:
Observations from EDA:
#Dropping ProdTaken_bol columns from the dataframe and creating df2 for the modeling purpose
## df1.drop(columns=['ProdTaken_bol'], inplace=True)
df2=df1.copy()
df2.sample(10)
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | lg_DurationOfPitch | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2981 | 0 | 28.0 | Company Invited | 1 | 17.0 | Salaried | Male | 3 | 4.0 | Standard | 5.0 | Married | 3.0 | 0 | 3 | 1 | 1.0 | Senior Manager | 27471.0 | 1.230449 |
| 1022 | 0 | 36.0 | Company Invited | 1 | 11.0 | Large Business | Male | 2 | 1.0 | Basic | 3.0 | Single | 1.0 | 0 | 5 | 1 | 0.0 | Executive | 18500.0 | 1.041393 |
| 3284 | 0 | 36.0 | Self Enquiry | 1 | 7.0 | Small Business | Male | 3 | 5.0 | Basic | 3.0 | Divorced | 8.0 | 0 | 2 | 1 | 2.0 | Executive | 20936.0 | 0.845098 |
| 2445 | 0 | 50.0 | Company Invited | 1 | 15.0 | Salaried | Male | 4 | 5.0 | Deluxe | 4.0 | Divorced | 3.0 | 0 | 3 | 1 | 3.0 | Manager | 23808.0 | 1.176091 |
| 1441 | 0 | 56.0 | Company Invited | 1 | 6.0 | Salaried | Male | 2 | 3.0 | Deluxe | 3.0 | Married | 2.0 | 0 | 3 | 0 | 0.0 | Manager | 21306.0 | 0.778151 |
| 176 | 0 | 33.0 | Self Enquiry | 1 | 8.0 | Salaried | Male | 2 | 3.0 | Basic | 3.0 | Single | 1.0 | 0 | 3 | 0 | 1.0 | Executive | 17500.0 | 0.903090 |
| 3334 | 0 | 55.0 | Company Invited | 1 | 7.0 | Salaried | Female | 3 | 4.0 | Standard | 3.0 | Married | 2.0 | 0 | 5 | 1 | 2.0 | Senior Manager | 29180.0 | 0.845098 |
| 3022 | 0 | 39.0 | Company Invited | 1 | 9.0 | Salaried | Female | 4 | 2.0 | Deluxe | 5.0 | Unmarried | 8.0 | 1 | 2 | 1 | 3.0 | Manager | 24658.0 | 0.954243 |
| 1725 | 0 | 25.0 | Self Enquiry | 1 | 13.0 | Salaried | Female | 3 | 3.0 | Deluxe | 3.0 | Married | 1.0 | 0 | 3 | 1 | 1.0 | Manager | 19898.0 | 1.113943 |
| 1516 | 0 | 34.0 | Company Invited | 3 | 13.0 | Small Business | Male | 3 | 3.0 | Deluxe | 3.0 | Single | 1.0 | 0 | 5 | 0 | 1.0 | Manager | 19568.0 | 1.113943 |
stratify parameter to target variable in the train_test_split function# Separating features and the target column ProdTaken
X = df2.drop('ProdTaken', axis=1)
y = df2['ProdTaken']
## get does create dummies (one hot encoding for all the columns )
X = pd.get_dummies(X, drop_first=True) ## Drop_first reduces multicolinearity
# Splitting the data into train and test sets in 70:30 ratio
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1,stratify=y, shuffle=True)
X_train.shape, X_test.shape
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Shape of Training set : (3421, 30) Shape of test set : (1467, 30) Percentage of classes in training set: 0 0.811751 1 0.188249 Name: ProdTaken, dtype: float64 Percentage of classes in test set: 0 0.811861 1 0.188139 Name: ProdTaken, dtype: float64
Both the cases are important as:
Predicting a customer who bought a package wrong(false negetive) will result in loss of revenues.
Predicting a customer who did not buy a package wrong(false positive) can also result wasteful marketing expense.
f1_score should be maximized, the greater the f1_score higher the chances of identifying both the classes correctly. ## Function to calculate different metric scores of the model - Accuracy, Recall and Precision
def get_metrics_score(model,flag=True):
'''
model : classifier to predict values of X
'''
# defining an empty list to store train and test results
score_list=[]
#Predicting on train and tests
pred_train = model.predict(X_train)
pred_test = model.predict(X_test)
#Accuracy of the model
train_acc = model.score(X_train,y_train)
test_acc = model.score(X_test,y_test)
#Recall of the model
train_recall = metrics.recall_score(y_train,pred_train)
test_recall = metrics.recall_score(y_test,pred_test)
#Precision of the model
train_precision = metrics.precision_score(y_train,pred_train)
test_precision = metrics.precision_score(y_test,pred_test)
score_list.extend((train_acc,test_acc,train_recall,test_recall,train_precision,test_precision))
# If the flag is set to True then only the following print statements will be dispayed. The default value is set to True.
if flag == True:
print("Accuracy on training set : ",model.score(X_train,y_train))
print("Accuracy on test set : ",model.score(X_test,y_test))
print("Recall on training set : ",metrics.recall_score(y_train,pred_train))
print("Recall on test set : ",metrics.recall_score(y_test,pred_test))
print("Precision on training set : ",metrics.precision_score(y_train,pred_train))
print("Precision on test set : ",metrics.precision_score(y_test,pred_test))
return score_list # returning the list with train and test scores
## Function to create confusion matrix
def make_confusion_matrix(model,y_actual,labels=[1, 0]):
'''
model : classifier to predict values of X
y_actual : ground truth
'''
y_predict = model.predict(X_test)
cm=metrics.confusion_matrix( y_actual, y_predict, labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["Actual - No","Actual - Yes"]],
columns = [i for i in ['Predicted - No','Predicted - Yes']])
group_counts = ["{0:0.0f}".format(value) for value in
cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
cm.flatten()/np.sum(cm)]
labels = [f"{v1}\n{v2}" for v1, v2 in
zip(group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=labels,fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')
# function to compute adjusted R-squared
def adj_r2_score(predictors, targets, predictions):
r2 = r2_score(targets, predictions)
n = predictors.shape[0]
k = predictors.shape[1]
return 1 - ((1 - r2) * (n - 1) / (n - k - 1))
# function to compute MAPE
def mape_score(targets, predictions):
return np.mean(np.abs(targets - predictions) / targets) * 100
# function to compute different metrics to check performance of a regression model
def model_performance_regression(model, predictors, target):
"""
Function to compute different metrics to check regression model performance
model: regressor
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
r2 = r2_score(target, pred) # to compute R-squared
adjr2 = adj_r2_score(predictors, target, pred) # to compute adjusted R-squared
rmse = np.sqrt(mean_squared_error(target, pred)) # to compute RMSE
mae = mean_absolute_error(target, pred) # to compute MAE
mape = mape_score(target, pred) # to compute MAPE
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"RMSE": rmse,
"MAE": mae,
"R-squared": r2,
"Adj. R-squared": adjr2,
"MAPE": mape,
},
index=[0],
)
return df_perf
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn_with_threshold(model, predictors, target, threshold=0.5):
"""
Function to compute different metrics, based on the threshold specified, to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
# predicting using the independent variables
pred_prob = model.predict_proba(predictors)[:, 1]
pred_thres = pred_prob > threshold
pred = np.round(pred_thres)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1,
},
index=[0],
)
return df_perf
# defining a function to plot the confusion_matrix of a classification model built using sklearn
def confusion_matrix_sklearn_with_threshold(model, predictors, target, threshold=0.5):
"""
To plot the confusion_matrix, based on the threshold specified, with percentages
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
pred_prob = model.predict_proba(predictors)[:, 1]
pred_thres = pred_prob > threshold
y_pred = np.round(pred_thres)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1,
},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
## Function to calculate recall score
def get_recall_score(model):
'''
model : classifier to predict values of X
'''
pred_train = model.predict(X_train)
pred_test = model.predict(X_test)
print("Recall on training set : ",metrics.recall_score(y_train,pred_train))
print("Recall on test set : ",metrics.recall_score(y_test,pred_test))
#Fitting the Traiing model
d_tree = DecisionTreeClassifier(criterion = 'gini',random_state=1)
d_tree.fit(X_train,y_train)
DecisionTreeClassifier(random_state=1)
## Training model performance
dtree_model_train_perf=get_metrics_score(d_tree)
Accuracy on training set : 1.0 Accuracy on test set : 0.8841172460804363 Recall on training set : 1.0 Recall on test set : 0.6811594202898551 Precision on training set : 1.0 Precision on test set : 0.6962962962962963
#Creating confusion matrix
make_confusion_matrix(d_tree,y_test)
dtree_model_train_perf=model_performance_classification_sklearn(d_tree, X_train, y_train)
print("Training performance \n",dtree_model_train_perf)
Training performance
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
dtree_model_test_perf=model_performance_classification_sklearn(d_tree, X_test, y_test)
print("Testing performance \n",dtree_model_test_perf)
Testing performance
Accuracy Recall Precision F1
0 0.884117 0.681159 0.696296 0.688645
feature_names = list(X.columns)
print(feature_names)
['Age', 'CityTier', 'DurationOfPitch', 'NumberOfPersonVisiting', 'NumberOfFollowups', 'PreferredPropertyStar', 'NumberOfTrips', 'Passport', 'PitchSatisfactionScore', 'OwnCar', 'NumberOfChildrenVisiting', 'MonthlyIncome', 'lg_DurationOfPitch', 'TypeofContact_Others', 'TypeofContact_Self Enquiry', 'Occupation_Large Business', 'Occupation_Salaried', 'Occupation_Small Business', 'Gender_Male', 'ProductPitched_Deluxe', 'ProductPitched_King', 'ProductPitched_Standard', 'ProductPitched_Super Deluxe', 'MaritalStatus_Married', 'MaritalStatus_Single', 'MaritalStatus_Unmarried', 'Designation_Executive', 'Designation_Manager', 'Designation_Senior Manager', 'Designation_VP']
plt.figure(figsize=(20,30))
tree.plot_tree(d_tree,feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=True)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(d_tree,feature_names=feature_names,show_weights=True))
|--- Passport <= 0.50 | |--- Age <= 21.50 | | |--- Occupation_Large Business <= 0.50 | | | |--- PitchSatisfactionScore <= 3.50 | | | | |--- Gender_Male <= 0.50 | | | | | |--- NumberOfChildrenVisiting <= 1.50 | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | | |--- NumberOfChildrenVisiting > 1.50 | | | | | | |--- weights: [4.00, 0.00] class: 0 | | | | |--- Gender_Male > 0.50 | | | | | |--- weights: [16.00, 0.00] class: 0 | | | |--- PitchSatisfactionScore > 3.50 | | | | |--- DurationOfPitch <= 7.50 | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | |--- DurationOfPitch > 7.50 | | | | | |--- Age <= 20.50 | | | | | | |--- DurationOfPitch <= 18.00 | | | | | | | |--- NumberOfFollowups <= 3.50 | | | | | | | | |--- Gender_Male <= 0.50 | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | |--- Gender_Male > 0.50 | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | |--- NumberOfFollowups > 3.50 | | | | | | | | |--- weights: [0.00, 9.00] class: 1 | | | | | | |--- DurationOfPitch > 18.00 | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | |--- Age > 20.50 | | | | | | |--- PreferredPropertyStar <= 3.50 | | | | | | | |--- Gender_Male <= 0.50 | | | | | | | | |--- MaritalStatus_Single <= 0.50 | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | |--- MaritalStatus_Single > 0.50 | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | |--- Gender_Male > 0.50 | | | | | | | | |--- weights: [5.00, 0.00] class: 0 | | | | | | |--- PreferredPropertyStar > 3.50 | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | |--- Occupation_Large Business > 0.50 | | | |--- weights: [0.00, 7.00] class: 1 | |--- Age > 21.50 | | |--- PreferredPropertyStar <= 4.50 | | | |--- MonthlyIncome <= 16559.00 | | | | |--- MaritalStatus_Single <= 0.50 | | | | | |--- PitchSatisfactionScore <= 1.50 | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | |--- PitchSatisfactionScore > 1.50 | | | | | | |--- weights: [4.00, 0.00] class: 0 | | | | |--- MaritalStatus_Single > 0.50 | | | | | |--- weights: [0.00, 4.00] class: 1 | | | |--- MonthlyIncome > 16559.00 | | | | |--- Occupation_Large Business <= 0.50 | | | | | |--- NumberOfFollowups <= 5.50 | | | | | | |--- Designation_Senior Manager <= 0.50 | | | | | | | |--- MaritalStatus_Unmarried <= 0.50 | | | | | | | | |--- MaritalStatus_Single <= 0.50 | | | | | | | | | |--- MonthlyIncome <= 19950.00 | | | | | | | | | | |--- MonthlyIncome <= 19687.50 | | | | | | | | | | | |--- truncated branch of depth 7 | | | | | | | | | | |--- MonthlyIncome > 19687.50 | | | | | | | | | | | |--- truncated branch of depth 8 | | | | | | | | | |--- MonthlyIncome > 19950.00 | | | | | | | | | | |--- MonthlyIncome <= 23356.00 | | | | | | | | | | | |--- truncated branch of depth 7 | | | | | | | | | | |--- MonthlyIncome > 23356.00 | | | | | | | | | | | |--- truncated branch of depth 10 | | | | | | | | |--- MaritalStatus_Single > 0.50 | | | | | | | | | |--- NumberOfFollowups <= 4.50 | | | | | | | | | | |--- TypeofContact_Others <= 0.50 | | | | | | | | | | | |--- truncated branch of depth 10 | | | | | | | | | | |--- TypeofContact_Others > 0.50 | | | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | | | |--- NumberOfFollowups > 4.50 | | | | | | | | | | |--- MonthlyIncome <= 18234.00 | | | | | | | | | | | |--- weights: [0.00, 4.00] class: 1 | | | | | | | | | | |--- MonthlyIncome > 18234.00 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | |--- MaritalStatus_Unmarried > 0.50 | | | | | | | | |--- Age <= 32.50 | | | | | | | | | |--- NumberOfTrips <= 3.50 | | | | | | | | | | |--- TypeofContact_Self Enquiry <= 0.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- TypeofContact_Self Enquiry > 0.50 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | | |--- NumberOfTrips > 3.50 | | | | | | | | | | |--- Gender_Male <= 0.50 | | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | | | |--- Gender_Male > 0.50 | | | | | | | | | | | |--- weights: [0.00, 5.00] class: 1 | | | | | | | | |--- Age > 32.50 | | | | | | | | | |--- Occupation_Small Business <= 0.50 | | | | | | | | | | |--- Age <= 33.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- Age > 33.50 | | | | | | | | | | | |--- weights: [69.00, 0.00] class: 0 | | | | | | | | | |--- Occupation_Small Business > 0.50 | | | | | | | | | | |--- PreferredPropertyStar <= 3.50 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | | |--- PreferredPropertyStar > 3.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | |--- Designation_Senior Manager > 0.50 | | | | | | | |--- DurationOfPitch <= 16.50 | | | | | | | | |--- MonthlyIncome <= 21584.50 | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | | |--- MonthlyIncome > 21584.50 | | | | | | | | | |--- DurationOfPitch <= 7.50 | | | | | | | | | | |--- NumberOfTrips <= 5.00 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- NumberOfTrips > 5.00 | | | | | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | | | | | | |--- DurationOfPitch > 7.50 | | | | | | | | | | |--- MonthlyIncome <= 25721.50 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | | | |--- MonthlyIncome > 25721.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | |--- DurationOfPitch > 16.50 | | | | | | | | |--- Age <= 33.00 | | | | | | | | | |--- DurationOfPitch <= 19.50 | | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | | | |--- DurationOfPitch > 19.50 | | | | | | | | | | |--- NumberOfFollowups <= 4.50 | | | | | | | | | | | |--- weights: [0.00, 8.00] class: 1 | | | | | | | | | | |--- NumberOfFollowups > 4.50 | | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | |--- Age > 33.00 | | | | | | | | | |--- Age <= 40.50 | | | | | | | | | | |--- NumberOfTrips <= 6.50 | | | | | | | | | | | |--- weights: [24.00, 0.00] class: 0 | | | | | | | | | | |--- NumberOfTrips > 6.50 | | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | |--- Age > 40.50 | | | | | | | | | | |--- NumberOfTrips <= 3.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- NumberOfTrips > 3.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | |--- NumberOfFollowups > 5.50 | | | | | | |--- CityTier <= 1.50 | | | | | | | |--- MaritalStatus_Single <= 0.50 | | | | | | | | |--- NumberOfTrips <= 6.50 | | | | | | | | | |--- weights: [33.00, 0.00] class: 0 | | | | | | | | |--- NumberOfTrips > 6.50 | | | | | | | | | |--- MonthlyIncome <= 23807.50 | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | |--- MonthlyIncome > 23807.50 | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | |--- MaritalStatus_Single > 0.50 | | | | | | | | |--- MonthlyIncome <= 20969.50 | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | | |--- MonthlyIncome > 20969.50 | | | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | | | |--- CityTier > 1.50 | | | | | | | |--- TypeofContact_Self Enquiry <= 0.50 | | | | | | | | |--- ProductPitched_Deluxe <= 0.50 | | | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | | | | | |--- ProductPitched_Deluxe > 0.50 | | | | | | | | | |--- weights: [7.00, 0.00] class: 0 | | | | | | | |--- TypeofContact_Self Enquiry > 0.50 | | | | | | | | |--- NumberOfTrips <= 4.50 | | | | | | | | | |--- weights: [0.00, 7.00] class: 1 | | | | | | | | |--- NumberOfTrips > 4.50 | | | | | | | | | |--- OwnCar <= 0.50 | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | |--- OwnCar > 0.50 | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | |--- Occupation_Large Business > 0.50 | | | | | |--- Age <= 57.50 | | | | | | |--- Age <= 30.50 | | | | | | | |--- NumberOfTrips <= 5.50 | | | | | | | | |--- lg_DurationOfPitch <= 0.87 | | | | | | | | | |--- PitchSatisfactionScore <= 4.00 | | | | | | | | | | |--- ProductPitched_Deluxe <= 0.50 | | | | | | | | | | | |--- weights: [0.00, 5.00] class: 1 | | | | | | | | | | |--- ProductPitched_Deluxe > 0.50 | | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | | |--- PitchSatisfactionScore > 4.00 | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | |--- lg_DurationOfPitch > 0.87 | | | | | | | | | |--- weights: [20.00, 0.00] class: 0 | | | | | | | |--- NumberOfTrips > 5.50 | | | | | | | | |--- DurationOfPitch <= 9.00 | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | |--- DurationOfPitch > 9.00 | | | | | | | | | |--- MonthlyIncome <= 22756.00 | | | | | | | | | | |--- weights: [0.00, 8.00] class: 1 | | | | | | | | | |--- MonthlyIncome > 22756.00 | | | | | | | | | | |--- PreferredPropertyStar <= 3.50 | | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | | | |--- PreferredPropertyStar > 3.50 | | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- Age > 30.50 | | | | | | | |--- MonthlyIncome <= 17322.50 | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | |--- MonthlyIncome > 17322.50 | | | | | | | | |--- MonthlyIncome <= 34388.50 | | | | | | | | | |--- Age <= 56.50 | | | | | | | | | | |--- CityTier <= 1.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- CityTier > 1.50 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | |--- Age > 56.50 | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | |--- MonthlyIncome > 34388.50 | | | | | | | | | |--- Age <= 42.50 | | | | | | | | | | |--- weights: [0.00, 4.00] class: 1 | | | | | | | | | |--- Age > 42.50 | | | | | | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | | |--- Age > 57.50 | | | | | | |--- weights: [0.00, 5.00] class: 1 | | |--- PreferredPropertyStar > 4.50 | | | |--- NumberOfTrips <= 6.50 | | | | |--- DurationOfPitch <= 14.50 | | | | | |--- Designation_Executive <= 0.50 | | | | | | |--- CityTier <= 1.50 | | | | | | | |--- MaritalStatus_Single <= 0.50 | | | | | | | | |--- weights: [77.00, 0.00] class: 0 | | | | | | | |--- MaritalStatus_Single > 0.50 | | | | | | | | |--- lg_DurationOfPitch <= 1.10 | | | | | | | | | |--- weights: [10.00, 0.00] class: 0 | | | | | | | | |--- lg_DurationOfPitch > 1.10 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- CityTier > 1.50 | | | | | | | |--- ProductPitched_Standard <= 0.50 | | | | | | | | |--- NumberOfTrips <= 2.50 | | | | | | | | | |--- MonthlyIncome <= 32814.00 | | | | | | | | | | |--- weights: [39.00, 0.00] class: 0 | | | | | | | | | |--- MonthlyIncome > 32814.00 | | | | | | | | | | |--- NumberOfChildrenVisiting <= 1.50 | | | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | | | | |--- NumberOfChildrenVisiting > 1.50 | | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | |--- NumberOfTrips > 2.50 | | | | | | | | | |--- lg_DurationOfPitch <= 1.08 | | | | | | | | | | |--- NumberOfPersonVisiting <= 3.50 | | | | | | | | | | | |--- weights: [18.00, 0.00] class: 0 | | | | | | | | | | |--- NumberOfPersonVisiting > 3.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- lg_DurationOfPitch > 1.08 | | | | | | | | | | |--- OwnCar <= 0.50 | | | | | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | | | | | | | |--- OwnCar > 0.50 | | | | | | | | | | | |--- weights: [5.00, 0.00] class: 0 | | | | | | | |--- ProductPitched_Standard > 0.50 | | | | | | | | |--- MonthlyIncome <= 25059.50 | | | | | | | | | |--- weights: [4.00, 0.00] class: 0 | | | | | | | | |--- MonthlyIncome > 25059.50 | | | | | | | | | |--- weights: [0.00, 4.00] class: 1 | | | | | |--- Designation_Executive > 0.50 | | | | | | |--- DurationOfPitch <= 7.50 | | | | | | | |--- weights: [26.00, 0.00] class: 0 | | | | | | |--- DurationOfPitch > 7.50 | | | | | | | |--- Gender_Male <= 0.50 | | | | | | | | |--- Age <= 32.50 | | | | | | | | | |--- weights: [12.00, 0.00] class: 0 | | | | | | | | |--- Age > 32.50 | | | | | | | | | |--- MonthlyIncome <= 21445.00 | | | | | | | | | | |--- Age <= 35.00 | | | | | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | | | | | | | |--- Age > 35.00 | | | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | | | |--- MonthlyIncome > 21445.00 | | | | | | | | | | |--- weights: [7.00, 0.00] class: 0 | | | | | | | |--- Gender_Male > 0.50 | | | | | | | | |--- Age <= 25.00 | | | | | | | | | |--- weights: [8.00, 0.00] class: 0 | | | | | | | | |--- Age > 25.00 | | | | | | | | | |--- NumberOfFollowups <= 1.50 | | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | | | |--- NumberOfFollowups > 1.50 | | | | | | | | | | |--- MonthlyIncome <= 17352.00 | | | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | | | | |--- MonthlyIncome > 17352.00 | | | | | | | | | | | |--- weights: [0.00, 12.00] class: 1 | | | | |--- DurationOfPitch > 14.50 | | | | | |--- MaritalStatus_Single <= 0.50 | | | | | | |--- CityTier <= 1.50 | | | | | | | |--- MaritalStatus_Unmarried <= 0.50 | | | | | | | | |--- NumberOfTrips <= 4.50 | | | | | | | | | |--- weights: [68.00, 0.00] class: 0 | | | | | | | | |--- NumberOfTrips > 4.50 | | | | | | | | | |--- TypeofContact_Self Enquiry <= 0.50 | | | | | | | | | | |--- MonthlyIncome <= 28876.00 | | | | | | | | | | | |--- weights: [0.00, 4.00] class: 1 | | | | | | | | | | |--- MonthlyIncome > 28876.00 | | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | | |--- TypeofContact_Self Enquiry > 0.50 | | | | | | | | | | |--- weights: [5.00, 0.00] class: 0 | | | | | | | |--- MaritalStatus_Unmarried > 0.50 | | | | | | | | |--- PitchSatisfactionScore <= 4.50 | | | | | | | | | |--- lg_DurationOfPitch <= 1.52 | | | | | | | | | | |--- weights: [0.00, 6.00] class: 1 | | | | | | | | | |--- lg_DurationOfPitch > 1.52 | | | | | | | | | | |--- PitchSatisfactionScore <= 2.00 | | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | | | |--- PitchSatisfactionScore > 2.00 | | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | |--- PitchSatisfactionScore > 4.50 | | | | | | | | | |--- weights: [4.00, 0.00] class: 0 | | | | | | |--- CityTier > 1.50 | | | | | | | |--- Designation_Executive <= 0.50 | | | | | | | | |--- Age <= 38.00 | | | | | | | | | |--- NumberOfPersonVisiting <= 3.50 | | | | | | | | | | |--- DurationOfPitch <= 32.50 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | | |--- DurationOfPitch > 32.50 | | | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | | | |--- NumberOfPersonVisiting > 3.50 | | | | | | | | | | |--- NumberOfFollowups <= 5.50 | | | | | | | | | | | |--- weights: [7.00, 0.00] class: 0 | | | | | | | | | | |--- NumberOfFollowups > 5.50 | | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | |--- Age > 38.00 | | | | | | | | | |--- MonthlyIncome <= 23140.50 | | | | | | | | | | |--- MonthlyIncome <= 22800.50 | | | | | | | | | | | |--- weights: [4.00, 0.00] class: 0 | | | | | | | | | | |--- MonthlyIncome > 22800.50 | | | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | | | |--- MonthlyIncome > 23140.50 | | | | | | | | | | |--- weights: [17.00, 0.00] class: 0 | | | | | | | |--- Designation_Executive > 0.50 | | | | | | | | |--- MaritalStatus_Married <= 0.50 | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | |--- MaritalStatus_Married > 0.50 | | | | | | | | | |--- weights: [0.00, 7.00] class: 1 | | | | | |--- MaritalStatus_Single > 0.50 | | | | | | |--- Occupation_Small Business <= 0.50 | | | | | | | |--- weights: [0.00, 8.00] class: 1 | | | | | | |--- Occupation_Small Business > 0.50 | | | | | | | |--- PitchSatisfactionScore <= 2.00 | | | | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | | | | |--- PitchSatisfactionScore > 2.00 | | | | | | | | |--- PitchSatisfactionScore <= 3.50 | | | | | | | | | |--- weights: [0.00, 6.00] class: 1 | | | | | | | | |--- PitchSatisfactionScore > 3.50 | | | | | | | | | |--- DurationOfPitch <= 19.00 | | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | | | |--- DurationOfPitch > 19.00 | | | | | | | | | | |--- weights: [4.00, 0.00] class: 0 | | | |--- NumberOfTrips > 6.50 | | | | |--- Age <= 32.00 | | | | | |--- MaritalStatus_Married <= 0.50 | | | | | | |--- weights: [0.00, 11.00] class: 1 | | | | | |--- MaritalStatus_Married > 0.50 | | | | | | |--- lg_DurationOfPitch <= 1.04 | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- lg_DurationOfPitch > 1.04 | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | |--- Age > 32.00 | | | | | |--- TypeofContact_Self Enquiry <= 0.50 | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | |--- TypeofContact_Self Enquiry > 0.50 | | | | | | |--- CityTier <= 2.50 | | | | | | | |--- weights: [15.00, 0.00] class: 0 | | | | | | |--- CityTier > 2.50 | | | | | | | |--- weights: [0.00, 2.00] class: 1 |--- Passport > 0.50 | |--- Designation_Executive <= 0.50 | | |--- CityTier <= 1.50 | | | |--- DurationOfPitch <= 35.50 | | | | |--- NumberOfTrips <= 1.50 | | | | | |--- Age <= 56.50 | | | | | | |--- PitchSatisfactionScore <= 4.50 | | | | | | | |--- DurationOfPitch <= 30.00 | | | | | | | | |--- Age <= 27.50 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | |--- Age > 27.50 | | | | | | | | | |--- weights: [24.00, 0.00] class: 0 | | | | | | | |--- DurationOfPitch > 30.00 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- PitchSatisfactionScore > 4.50 | | | | | | | |--- lg_DurationOfPitch <= 1.08 | | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | | | | |--- lg_DurationOfPitch > 1.08 | | | | | | | | |--- Occupation_Salaried <= 0.50 | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | | |--- Occupation_Salaried > 0.50 | | | | | | | | | |--- weights: [4.00, 0.00] class: 0 | | | | | |--- Age > 56.50 | | | | | | |--- weights: [0.00, 4.00] class: 1 | | | | |--- NumberOfTrips > 1.50 | | | | | |--- Age <= 25.00 | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | |--- Age > 25.00 | | | | | | |--- lg_DurationOfPitch <= 1.44 | | | | | | | |--- MonthlyIncome <= 34780.00 | | | | | | | | |--- NumberOfChildrenVisiting <= 2.50 | | | | | | | | | |--- weights: [184.00, 0.00] class: 0 | | | | | | | | |--- NumberOfChildrenVisiting > 2.50 | | | | | | | | | |--- lg_DurationOfPitch <= 0.90 | | | | | | | | | | |--- Occupation_Salaried <= 0.50 | | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | | | |--- Occupation_Salaried > 0.50 | | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | |--- lg_DurationOfPitch > 0.90 | | | | | | | | | | |--- weights: [22.00, 0.00] class: 0 | | | | | | | |--- MonthlyIncome > 34780.00 | | | | | | | | |--- NumberOfFollowups <= 3.50 | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | | |--- NumberOfFollowups > 3.50 | | | | | | | | | |--- Age <= 57.50 | | | | | | | | | | |--- weights: [19.00, 0.00] class: 0 | | | | | | | | | |--- Age > 57.50 | | | | | | | | | | |--- Occupation_Salaried <= 0.50 | | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | | | |--- Occupation_Salaried > 0.50 | | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- lg_DurationOfPitch > 1.44 | | | | | | | |--- ProductPitched_Deluxe <= 0.50 | | | | | | | | |--- weights: [25.00, 0.00] class: 0 | | | | | | | |--- ProductPitched_Deluxe > 0.50 | | | | | | | | |--- PitchSatisfactionScore <= 3.50 | | | | | | | | | |--- weights: [8.00, 0.00] class: 0 | | | | | | | | |--- PitchSatisfactionScore > 3.50 | | | | | | | | | |--- weights: [0.00, 6.00] class: 1 | | | |--- DurationOfPitch > 35.50 | | | | |--- weights: [0.00, 3.00] class: 1 | | |--- CityTier > 1.50 | | | |--- lg_DurationOfPitch <= 1.24 | | | | |--- MaritalStatus_Married <= 0.50 | | | | | |--- PreferredPropertyStar <= 3.50 | | | | | | |--- MonthlyIncome <= 21377.00 | | | | | | | |--- PitchSatisfactionScore <= 4.50 | | | | | | | | |--- Occupation_Large Business <= 0.50 | | | | | | | | | |--- NumberOfChildrenVisiting <= 1.50 | | | | | | | | | | |--- weights: [0.00, 8.00] class: 1 | | | | | | | | | |--- NumberOfChildrenVisiting > 1.50 | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | |--- Occupation_Large Business > 0.50 | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | |--- PitchSatisfactionScore > 4.50 | | | | | | | | |--- Age <= 40.50 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | |--- Age > 40.50 | | | | | | | | | |--- weights: [4.00, 0.00] class: 0 | | | | | | |--- MonthlyIncome > 21377.00 | | | | | | | |--- MaritalStatus_Single <= 0.50 | | | | | | | | |--- lg_DurationOfPitch <= 1.19 | | | | | | | | | |--- MonthlyIncome <= 23285.50 | | | | | | | | | | |--- Age <= 49.00 | | | | | | | | | | | |--- weights: [9.00, 0.00] class: 0 | | | | | | | | | | |--- Age > 49.00 | | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | |--- MonthlyIncome > 23285.50 | | | | | | | | | | |--- MonthlyIncome <= 24436.00 | | | | | | | | | | | |--- weights: [0.00, 4.00] class: 1 | | | | | | | | | | |--- MonthlyIncome > 24436.00 | | | | | | | | | | | |--- truncated branch of depth 6 | | | | | | | | |--- lg_DurationOfPitch > 1.19 | | | | | | | | | |--- weights: [6.00, 0.00] class: 0 | | | | | | | |--- MaritalStatus_Single > 0.50 | | | | | | | | |--- weights: [11.00, 0.00] class: 0 | | | | | |--- PreferredPropertyStar > 3.50 | | | | | | |--- MaritalStatus_Unmarried <= 0.50 | | | | | | | |--- MaritalStatus_Single <= 0.50 | | | | | | | | |--- weights: [10.00, 0.00] class: 0 | | | | | | | |--- MaritalStatus_Single > 0.50 | | | | | | | | |--- NumberOfFollowups <= 3.50 | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | |--- NumberOfFollowups > 3.50 | | | | | | | | | |--- weights: [0.00, 4.00] class: 1 | | | | | | |--- MaritalStatus_Unmarried > 0.50 | | | | | | | |--- NumberOfTrips <= 3.50 | | | | | | | | |--- MonthlyIncome <= 23220.00 | | | | | | | | | |--- Gender_Male <= 0.50 | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | | |--- Gender_Male > 0.50 | | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | | |--- MonthlyIncome > 23220.00 | | | | | | | | | |--- weights: [0.00, 10.00] class: 1 | | | | | | | |--- NumberOfTrips > 3.50 | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | |--- MaritalStatus_Married > 0.50 | | | | | |--- Occupation_Small Business <= 0.50 | | | | | | |--- MonthlyIncome <= 19647.50 | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- MonthlyIncome > 19647.50 | | | | | | | |--- Gender_Male <= 0.50 | | | | | | | | |--- weights: [19.00, 0.00] class: 0 | | | | | | | |--- Gender_Male > 0.50 | | | | | | | | |--- NumberOfFollowups <= 4.50 | | | | | | | | | |--- Occupation_Large Business <= 0.50 | | | | | | | | | | |--- NumberOfChildrenVisiting <= 0.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- NumberOfChildrenVisiting > 0.50 | | | | | | | | | | | |--- weights: [16.00, 0.00] class: 0 | | | | | | | | | |--- Occupation_Large Business > 0.50 | | | | | | | | | | |--- Designation_Senior Manager <= 0.50 | | | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | | | | |--- Designation_Senior Manager > 0.50 | | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | |--- NumberOfFollowups > 4.50 | | | | | | | | | |--- TypeofContact_Self Enquiry <= 0.50 | | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | | | |--- TypeofContact_Self Enquiry > 0.50 | | | | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | | |--- Occupation_Small Business > 0.50 | | | | | | |--- MonthlyIncome <= 20346.00 | | | | | | | |--- MonthlyIncome <= 20229.00 | | | | | | | | |--- weights: [8.00, 0.00] class: 0 | | | | | | | |--- MonthlyIncome > 20229.00 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- MonthlyIncome > 20346.00 | | | | | | | |--- weights: [51.00, 0.00] class: 0 | | | |--- lg_DurationOfPitch > 1.24 | | | | |--- DurationOfPitch <= 20.50 | | | | | |--- weights: [0.00, 13.00] class: 1 | | | | |--- DurationOfPitch > 20.50 | | | | | |--- MonthlyIncome <= 31915.50 | | | | | | |--- NumberOfFollowups <= 4.50 | | | | | | | |--- TypeofContact_Self Enquiry <= 0.50 | | | | | | | | |--- Occupation_Large Business <= 0.50 | | | | | | | | | |--- DurationOfPitch <= 27.50 | | | | | | | | | | |--- MonthlyIncome <= 21972.50 | | | | | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | | | | | | | |--- MonthlyIncome > 21972.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- DurationOfPitch > 27.50 | | | | | | | | | | |--- weights: [0.00, 12.00] class: 1 | | | | | | | | |--- Occupation_Large Business > 0.50 | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | |--- TypeofContact_Self Enquiry > 0.50 | | | | | | | | |--- DurationOfPitch <= 29.50 | | | | | | | | | |--- Age <= 34.50 | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | |--- Age > 34.50 | | | | | | | | | | |--- MonthlyIncome <= 21731.00 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- MonthlyIncome > 21731.00 | | | | | | | | | | | |--- weights: [11.00, 0.00] class: 0 | | | | | | | | |--- DurationOfPitch > 29.50 | | | | | | | | | |--- PitchSatisfactionScore <= 2.50 | | | | | | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | | | | | | |--- PitchSatisfactionScore > 2.50 | | | | | | | | | | |--- weights: [0.00, 6.00] class: 1 | | | | | | |--- NumberOfFollowups > 4.50 | | | | | | | |--- Age <= 29.50 | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | |--- Age > 29.50 | | | | | | | | |--- weights: [0.00, 10.00] class: 1 | | | | | |--- MonthlyIncome > 31915.50 | | | | | | |--- ProductPitched_Standard <= 0.50 | | | | | | | |--- weights: [13.00, 0.00] class: 0 | | | | | | |--- ProductPitched_Standard > 0.50 | | | | | | | |--- weights: [0.00, 2.00] class: 1 | |--- Designation_Executive > 0.50 | | |--- MaritalStatus_Single <= 0.50 | | | |--- NumberOfFollowups <= 3.50 | | | | |--- Age <= 30.50 | | | | | |--- Age <= 27.50 | | | | | | |--- MonthlyIncome <= 20642.50 | | | | | | | |--- NumberOfTrips <= 2.50 | | | | | | | | |--- CityTier <= 1.50 | | | | | | | | | |--- Gender_Male <= 0.50 | | | | | | | | | | |--- PitchSatisfactionScore <= 2.50 | | | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | | | | |--- PitchSatisfactionScore > 2.50 | | | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | | | |--- Gender_Male > 0.50 | | | | | | | | | | |--- weights: [6.00, 0.00] class: 0 | | | | | | | | |--- CityTier > 1.50 | | | | | | | | | |--- weights: [0.00, 4.00] class: 1 | | | | | | | |--- NumberOfTrips > 2.50 | | | | | | | | |--- weights: [0.00, 6.00] class: 1 | | | | | | |--- MonthlyIncome > 20642.50 | | | | | | | |--- weights: [5.00, 0.00] class: 0 | | | | | |--- Age > 27.50 | | | | | | |--- weights: [0.00, 6.00] class: 1 | | | | |--- Age > 30.50 | | | | | |--- NumberOfTrips <= 7.50 | | | | | | |--- MonthlyIncome <= 17510.00 | | | | | | | |--- PitchSatisfactionScore <= 2.50 | | | | | | | | |--- CityTier <= 2.50 | | | | | | | | | |--- TypeofContact_Self Enquiry <= 0.50 | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | | |--- TypeofContact_Self Enquiry > 0.50 | | | | | | | | | | |--- weights: [0.00, 7.00] class: 1 | | | | | | | | |--- CityTier > 2.50 | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | |--- PitchSatisfactionScore > 2.50 | | | | | | | | |--- weights: [8.00, 0.00] class: 0 | | | | | | |--- MonthlyIncome > 17510.00 | | | | | | | |--- CityTier <= 2.50 | | | | | | | | |--- lg_DurationOfPitch <= 1.47 | | | | | | | | | |--- Occupation_Large Business <= 0.50 | | | | | | | | | | |--- TypeofContact_Self Enquiry <= 0.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- TypeofContact_Self Enquiry > 0.50 | | | | | | | | | | | |--- weights: [38.00, 0.00] class: 0 | | | | | | | | | |--- Occupation_Large Business > 0.50 | | | | | | | | | | |--- NumberOfTrips <= 1.50 | | | | | | | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | | | | | | | |--- NumberOfTrips > 1.50 | | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | |--- lg_DurationOfPitch > 1.47 | | | | | | | | | |--- lg_DurationOfPitch <= 1.50 | | | | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | | | | | | |--- lg_DurationOfPitch > 1.50 | | | | | | | | | | |--- weights: [4.00, 0.00] class: 0 | | | | | | | |--- CityTier > 2.50 | | | | | | | | |--- lg_DurationOfPitch <= 1.21 | | | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | | | | | |--- lg_DurationOfPitch > 1.21 | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | |--- NumberOfTrips > 7.50 | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | |--- NumberOfFollowups > 3.50 | | | | |--- Age <= 26.50 | | | | | |--- DurationOfPitch <= 6.50 | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | |--- DurationOfPitch > 6.50 | | | | | | |--- lg_DurationOfPitch <= 1.47 | | | | | | | |--- NumberOfTrips <= 6.00 | | | | | | | | |--- weights: [0.00, 18.00] class: 1 | | | | | | | |--- NumberOfTrips > 6.00 | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | |--- lg_DurationOfPitch > 1.47 | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | |--- Age > 26.50 | | | | | |--- lg_DurationOfPitch <= 1.35 | | | | | | |--- NumberOfPersonVisiting <= 2.50 | | | | | | | |--- NumberOfTrips <= 1.50 | | | | | | | | |--- MonthlyIncome <= 17486.50 | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | | |--- MonthlyIncome > 17486.50 | | | | | | | | | |--- weights: [4.00, 0.00] class: 0 | | | | | | | |--- NumberOfTrips > 1.50 | | | | | | | | |--- Age <= 28.50 | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | | |--- Age > 28.50 | | | | | | | | | |--- weights: [0.00, 12.00] class: 1 | | | | | | |--- NumberOfPersonVisiting > 2.50 | | | | | | | |--- MonthlyIncome <= 23624.50 | | | | | | | | |--- Age <= 56.50 | | | | | | | | | |--- Age <= 48.50 | | | | | | | | | | |--- NumberOfTrips <= 5.50 | | | | | | | | | | | |--- truncated branch of depth 7 | | | | | | | | | | |--- NumberOfTrips > 5.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | |--- Age > 48.50 | | | | | | | | | | |--- weights: [0.00, 5.00] class: 1 | | | | | | | | |--- Age > 56.50 | | | | | | | | | |--- weights: [6.00, 0.00] class: 0 | | | | | | | |--- MonthlyIncome > 23624.50 | | | | | | | | |--- MonthlyIncome <= 66740.00 | | | | | | | | | |--- weights: [0.00, 7.00] class: 1 | | | | | | | | |--- MonthlyIncome > 66740.00 | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | |--- lg_DurationOfPitch > 1.35 | | | | | | |--- Occupation_Small Business <= 0.50 | | | | | | | |--- Age <= 29.50 | | | | | | | | |--- Gender_Male <= 0.50 | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | | |--- Gender_Male > 0.50 | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | |--- Age > 29.50 | | | | | | | | |--- weights: [0.00, 13.00] class: 1 | | | | | | |--- Occupation_Small Business > 0.50 | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | |--- MaritalStatus_Single > 0.50 | | | |--- DurationOfPitch <= 13.50 | | | | |--- Age <= 39.00 | | | | | |--- MonthlyIncome <= 21010.00 | | | | | | |--- OwnCar <= 0.50 | | | | | | | |--- Age <= 29.50 | | | | | | | | |--- CityTier <= 2.00 | | | | | | | | | |--- weights: [6.00, 0.00] class: 0 | | | | | | | | |--- CityTier > 2.00 | | | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | | | | |--- Age > 29.50 | | | | | | | | |--- DurationOfPitch <= 7.00 | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | |--- DurationOfPitch > 7.00 | | | | | | | | | |--- CityTier <= 2.00 | | | | | | | | | | |--- weights: [0.00, 9.00] class: 1 | | | | | | | | | |--- CityTier > 2.00 | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | |--- OwnCar > 0.50 | | | | | | | |--- Occupation_Small Business <= 0.50 | | | | | | | | |--- NumberOfPersonVisiting <= 3.50 | | | | | | | | | |--- weights: [0.00, 19.00] class: 1 | | | | | | | | |--- NumberOfPersonVisiting > 3.50 | | | | | | | | | |--- Occupation_Large Business <= 0.50 | | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | | | |--- Occupation_Large Business > 0.50 | | | | | | | | | | |--- NumberOfChildrenVisiting <= 2.50 | | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | | | |--- NumberOfChildrenVisiting > 2.50 | | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- Occupation_Small Business > 0.50 | | | | | | | | |--- PitchSatisfactionScore <= 3.50 | | | | | | | | | |--- Gender_Male <= 0.50 | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | |--- Gender_Male > 0.50 | | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | | |--- PitchSatisfactionScore > 3.50 | | | | | | | | | |--- weights: [0.00, 4.00] class: 1 | | | | | |--- MonthlyIncome > 21010.00 | | | | | | |--- lg_DurationOfPitch <= 0.84 | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- lg_DurationOfPitch > 0.84 | | | | | | | |--- weights: [6.00, 0.00] class: 0 | | | | |--- Age > 39.00 | | | | | |--- Occupation_Small Business <= 0.50 | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | |--- Occupation_Small Business > 0.50 | | | | | | |--- weights: [10.00, 0.00] class: 0 | | | |--- DurationOfPitch > 13.50 | | | | |--- NumberOfFollowups <= 2.00 | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | |--- NumberOfFollowups > 2.00 | | | | | |--- NumberOfTrips <= 1.50 | | | | | | |--- Occupation_Salaried <= 0.50 | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- Occupation_Salaried > 0.50 | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | |--- NumberOfTrips > 1.50 | | | | | | |--- weights: [0.00, 44.00] class: 1
# importance of features in the tree building ( The importance of a feature is computed as the
#(normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print (pd.DataFrame(d_tree.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
Imp Age 0.140941 MonthlyIncome 0.115453 NumberOfTrips 0.073109 Designation_Executive 0.072029 DurationOfPitch 0.071054 Passport 0.059067 PitchSatisfactionScore 0.057296 lg_DurationOfPitch 0.057057 CityTier 0.055988 NumberOfFollowups 0.046930 MaritalStatus_Single 0.035821 Gender_Male 0.026093 TypeofContact_Self Enquiry 0.026031 PreferredPropertyStar 0.019970 NumberOfChildrenVisiting 0.018128 NumberOfPersonVisiting 0.017723 MaritalStatus_Unmarried 0.017088 Occupation_Small Business 0.016372 Occupation_Large Business 0.014686 OwnCar 0.012002 MaritalStatus_Married 0.011139 Occupation_Salaried 0.009150 ProductPitched_Deluxe 0.008883 ProductPitched_Standard 0.005869 Designation_Senior Manager 0.003457 TypeofContact_Others 0.003222 ProductPitched_Super Deluxe 0.002955 Designation_VP 0.001530 Designation_Manager 0.000956 ProductPitched_King 0.000000
importances = d_tree.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
According to the decision tree model, Age is the most important variable for predicting the customer default.
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1)
# Grid of parameters to choose from
## add from article
parameters = {'max_depth': np.arange(1,20),
'min_samples_leaf': [1, 2, 5, 7, 10,15,20],
'max_leaf_nodes' : [2, 3, 5, 10],
'min_impurity_decrease': [0.001,0.01,0.1]
}
# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)
# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
estimator.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=2, max_leaf_nodes=3,
min_impurity_decrease=0.001, random_state=1)
make_confusion_matrix(estimator,y_test)
# Accuracy on train and test
print("Accuracy on training set : ",estimator.score(X_train, y_train))
print("Accuracy on test set : ",estimator.score(X_test, y_test))
# Recall on train and test
get_metrics_score(estimator)
Accuracy on training set : 0.8213972522654195 Accuracy on test set : 0.8411724608043627 Accuracy on training set : 0.8213972522654195 Accuracy on test set : 0.8411724608043627 Recall on training set : 0.3416149068322981 Recall on test set : 0.3695652173913043 Precision on training set : 0.5405405405405406 Precision on test set : 0.6335403726708074
[0.8213972522654195, 0.8411724608043627, 0.3416149068322981, 0.3695652173913043, 0.5405405405405406, 0.6335403726708074]
plt.figure(figsize=(10,5))
tree.plot_tree(estimator,feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=True)
plt.show()
Plotting the feature importance of each variable
# importance of features in the tree building ( The importance of a feature is computed as the
#(normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print (pd.DataFrame(estimator.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
Imp Passport 0.518003 Designation_Executive 0.481997 Age 0.000000 Occupation_Salaried 0.000000 Designation_Senior Manager 0.000000 Designation_Manager 0.000000 MaritalStatus_Unmarried 0.000000 MaritalStatus_Single 0.000000 MaritalStatus_Married 0.000000 ProductPitched_Super Deluxe 0.000000 ProductPitched_Standard 0.000000 ProductPitched_King 0.000000 ProductPitched_Deluxe 0.000000 Gender_Male 0.000000 Occupation_Small Business 0.000000 Occupation_Large Business 0.000000 CityTier 0.000000 TypeofContact_Self Enquiry 0.000000 TypeofContact_Others 0.000000 lg_DurationOfPitch 0.000000 MonthlyIncome 0.000000 NumberOfChildrenVisiting 0.000000 OwnCar 0.000000 PitchSatisfactionScore 0.000000 NumberOfTrips 0.000000 PreferredPropertyStar 0.000000 NumberOfFollowups 0.000000 NumberOfPersonVisiting 0.000000 DurationOfPitch 0.000000 Designation_VP 0.000000
feature_names = X_train.columns
importances = estimator.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
bagging = BaggingClassifier(random_state=1)
bagging.fit(X_train,y_train)
BaggingClassifier(random_state=1)
## Training set performance of unconstrained bagging model
bagging_model_train_perf=model_performance_classification_sklearn(bagging, X_train, y_train)
print("Training performance \n",bagging_model_train_perf)
Training performance
Accuracy Recall Precision F1
0 0.994154 0.970497 0.998403 0.984252
## Testing set performance of unconstrained bagging model
bagging_model_test_perf=model_performance_classification_sklearn(bagging, X_test, y_test)
print("Testing performance \n",bagging_model_test_perf)
Testing performance
Accuracy Recall Precision F1
0 0.913429 0.612319 0.89418 0.726882
#Creating confusion matrix
make_confusion_matrix(bagging,y_test)
Bagging Classifier with weighted decision tree
bagging_wt = BaggingClassifier(base_estimator=DecisionTreeClassifier(criterion='gini',class_weight={0:0.17,1:0.83},random_state=1),random_state=1)
bagging_wt.fit(X_train,y_train)
BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight={0: 0.17,
1: 0.83},
random_state=1),
random_state=1)
## Training set performance of weighted bagging model
bagging_model_train_perf=model_performance_classification_sklearn(bagging_wt, X_train, y_train)
print("Training performance \n",bagging_model_train_perf)
Training performance
Accuracy Recall Precision F1
0 0.993861 0.970497 0.99681 0.983478
## Testing set performance of weighted bagging model
bagging_model_test_perf=model_performance_classification_sklearn(bagging_wt, X_test, y_test)
print("Testing performance \n",bagging_model_test_perf)
Testing performance
Accuracy Recall Precision F1
0 0.900477 0.536232 0.891566 0.669683
*** Bagging classifier with a weighted decision tree is giving very good accuracy and prediction but is not able to generalize well on test data in terms of recall.
Some of the important hyperparameters available for bagging classifier are:
Tuning Bagging Classifier
# GRID search for bagging classifier with Hyper Parameters
cl1 = DecisionTreeClassifier(class_weight={0:0.17,1:0.83},random_state=1)
param_grid = {'base_estimator':[cl1],
'n_estimators':[5,7,15,51,101],
'max_features': [0.7,0.8,0.9,1]
}
grid = GridSearchCV(BaggingClassifier(random_state=1,bootstrap=True), param_grid=param_grid, scoring = 'recall')
grid.fit(X_train, y_train)
GridSearchCV(estimator=BaggingClassifier(random_state=1),
param_grid={'base_estimator': [DecisionTreeClassifier(class_weight={0: 0.17,
1: 0.83},
random_state=1)],
'max_features': [0.7, 0.8, 0.9, 1],
'n_estimators': [5, 7, 15, 51, 101]},
scoring='recall')
## getting the best estimator
bagging_estimator = grid.best_estimator_
bagging_estimator.fit(X_train,y_train)
BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight={0: 0.17,
1: 0.83},
random_state=1),
max_features=1, n_estimators=51, random_state=1)
#Using above defined function to get accuracy, recall and precision on train and test set
bagging_estimator_tuned_score=get_metrics_score(bagging_estimator)
Accuracy on training set : 0.3478515054077755 Accuracy on test set : 0.3265167007498296 Recall on training set : 0.9767080745341615 Recall on test set : 0.9710144927536232 Precision on training set : 0.2210896309314587 Precision on test set : 0.21474358974358973
bagging_wt_model_train_perf=model_performance_classification_sklearn(bagging_estimator,X_train,y_train)
print("Training performance \n",bagging_wt_model_train_perf)
Training performance
Accuracy Recall Precision F1
0 0.347852 0.976708 0.22109 0.360562
bagging_wt_model_test_perf=model_performance_classification_sklearn(bagging_estimator, X_test, y_test)
print("Testing performance \n",bagging_wt_model_test_perf)
Testing performance
Accuracy Recall Precision F1
0 0.326517 0.971014 0.214744 0.351706
'base_estimator':[cl1],
'n_estimators':[5,7,15,51,101],
'max_features': [0.7,0.8,0.9,1]
}
rf = RandomForestClassifier(random_state=1)
rf.fit(X_train,y_train)
RandomForestClassifier(random_state=1)
## confusion_matrix_sklearn(rf,X_test,y_test)
#Creating confusion matrix
make_confusion_matrix(rf,y_test)
rf_model_train_perf=model_performance_classification_sklearn(rf,X_train,y_train)
print("Training performance \n",rf_model_train_perf)
Training performance
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
rf_model_test_perf=model_performance_classification_sklearn(rf,X_test,y_test)
print("Training performance \n",rf_model_train_perf)
Training performance
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
Random forest with class weights
rf_wt = RandomForestClassifier(class_weight={0:0.17,1:0.83}, random_state=1)
rf_wt.fit(X_train,y_train)
RandomForestClassifier(class_weight={0: 0.17, 1: 0.83}, random_state=1)
## confusion_matrix_sklearn(rf_wt, X_test,y_test)
make_confusion_matrix(rf_wt,y_test)
rf_wt_model_train_perf=model_performance_classification_sklearn(rf_wt, X_train,y_train)
print("Training performance \n",rf_wt_model_train_perf)
Training performance
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
rf_wt_model_test_perf=model_performance_classification_sklearn(rf_wt, X_test,y_test)
print("Testing performance \n",rf_wt_model_test_perf)
Testing performance
Accuracy Recall Precision F1
0 0.911384 0.557971 0.950617 0.703196
# Choose the type of classifier.
rf_estimator = RandomForestClassifier(random_state=1)
# Grid of parameters to choose from
parameters = {
"n_estimators": [110,251,501],
"min_samples_leaf": np.arange(1, 6,1),
"max_features": [0.7,0.9,'log2','auto'],
"max_samples": [0.7,0.9,None],
}
# Run the grid search
grid_obj = GridSearchCV(rf_estimator, parameters, scoring='recall',cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
rf_estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
rf_estimator.fit(X_train, y_train)
RandomForestClassifier(max_features=0.9, n_estimators=110, random_state=1)
confusion_matrix_sklearn(rf_estimator, X_test,y_test)
##make_confusion_matrix(rf_tuned,y_test)
rf_estimator_model_train_perf=model_performance_classification_sklearn(rf_estimator, X_train,y_train)
print("Training performance \n",rf_estimator_model_train_perf)
Training performance
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
rf_estimator_model_test_perf=model_performance_classification_sklearn(rf_estimator, X_test, y_test)
print("Testing performance \n",rf_estimator_model_test_perf)
Testing performance
Accuracy Recall Precision F1
0 0.929789 0.677536 0.930348 0.784067
# importance of features in the tree building ( The importance of a feature is computed as the
#(normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print(pd.DataFrame(rf_estimator.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
Imp MonthlyIncome 0.135586 Age 0.125521 DurationOfPitch 0.068603 lg_DurationOfPitch 0.065719 Passport 0.064252 NumberOfTrips 0.061156 Designation_Executive 0.057699 NumberOfFollowups 0.053766 PitchSatisfactionScore 0.052807 CityTier 0.045572 PreferredPropertyStar 0.033873 MaritalStatus_Single 0.027760 NumberOfChildrenVisiting 0.023645 NumberOfPersonVisiting 0.021294 TypeofContact_Self Enquiry 0.019696 Gender_Male 0.019374 MaritalStatus_Unmarried 0.017918 Occupation_Large Business 0.016937 OwnCar 0.015173 MaritalStatus_Married 0.013944 Occupation_Small Business 0.013097 Occupation_Salaried 0.012924 Designation_Manager 0.007441 ProductPitched_Deluxe 0.007190 ProductPitched_Standard 0.006217 Designation_Senior Manager 0.005740 ProductPitched_Super Deluxe 0.003480 ProductPitched_King 0.001549 Designation_VP 0.001129 TypeofContact_Others 0.000937
feature_names = X_train.columns
importances = rf_estimator.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
# training performance comparison
models_train_comp_df = pd.concat(
[dtree_model_train_perf.T,
bagging_model_train_perf.T, bagging_wt_model_train_perf.T,rf_model_train_perf.T,
rf_wt_model_train_perf.T,bagging_model_train_perf.T,bagging_wt_model_train_perf.T,
rf_estimator_model_train_perf.T],
axis=1,
)
models_train_comp_df.columns = [
"Decision Tree",
"Bagging Classifier",
"Weighted Bagging Classifier",
"Random Forest Classifier",
"Weighted Random Forest Classifier",
"Bagging Classifier",
"Weighted Bagging Classifier",
"Random Forest Estimator"]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Decision Tree | Bagging Classifier | Weighted Bagging Classifier | Random Forest Classifier | Weighted Random Forest Classifier | Bagging Classifier | Weighted Bagging Classifier | Random Forest Estimator | |
|---|---|---|---|---|---|---|---|---|
| Accuracy | 1.0 | 0.993861 | 0.347852 | 1.0 | 1.0 | 0.993861 | 0.347852 | 1.0 |
| Recall | 1.0 | 0.970497 | 0.976708 | 1.0 | 1.0 | 0.970497 | 0.976708 | 1.0 |
| Precision | 1.0 | 0.996810 | 0.221090 | 1.0 | 1.0 | 0.996810 | 0.221090 | 1.0 |
| F1 | 1.0 | 0.983478 | 0.360562 | 1.0 | 1.0 | 0.983478 | 0.360562 | 1.0 |
# Test set performance comparison of all models
models_train_comp_df = pd.concat(
[dtree_model_test_perf.T,
bagging_model_test_perf.T, bagging_wt_model_test_perf.T,rf_model_test_perf.T,
rf_wt_model_test_perf.T,bagging_model_test_perf.T,bagging_wt_model_test_perf.T,
rf_estimator_model_test_perf.T],
axis=1,
)
models_train_comp_df.columns = [
"Decision Tree",
"Bagging Classifier",
"Weighted Bagging Classifier",
"Random Forest Classifier",
"Weighted Random Forest Classifier",
"Bagging Classifier",
"Weighted Bagging Classifier",
"Random Forest Estimator"]
print("Test performance comparison:")
models_train_comp_df
Test performance comparison:
| Decision Tree | Bagging Classifier | Weighted Bagging Classifier | Random Forest Classifier | Weighted Random Forest Classifier | Bagging Classifier | Weighted Bagging Classifier | Random Forest Estimator | |
|---|---|---|---|---|---|---|---|---|
| Accuracy | 0.884117 | 0.900477 | 0.326517 | 0.912747 | 0.911384 | 0.900477 | 0.326517 | 0.929789 |
| Recall | 0.681159 | 0.536232 | 0.971014 | 0.579710 | 0.557971 | 0.536232 | 0.971014 | 0.677536 |
| Precision | 0.696296 | 0.891566 | 0.214744 | 0.930233 | 0.950617 | 0.891566 | 0.214744 | 0.930348 |
| F1 | 0.688645 | 0.669683 | 0.351706 | 0.714286 | 0.703196 | 0.669683 | 0.351706 | 0.784067 |
#Fitting the AdaBoost Classifermodel
ab_classifier = AdaBoostClassifier(random_state=1)
ab_classifier.fit(X_train,y_train)
#Calculating different metrics
ab_classifier_model_train_perf=model_performance_classification_sklearn(ab_classifier,X_train,y_train)
print(ab_classifier_model_train_perf)
ab_classifier_model_test_perf=model_performance_classification_sklearn(ab_classifier,X_test,y_test)
print(ab_classifier_model_test_perf)
#Creating confusion matrix
confusion_matrix_sklearn(ab_classifier,X_test,y_test)
Accuracy Recall Precision F1 0 0.843905 0.319876 0.682119 0.435518 Accuracy Recall Precision F1 0 0.849352 0.326087 0.72 0.448878
# Choose the type of classifier.
abc_tuned = AdaBoostClassifier(random_state=1)
# Grid of parameters to choose from
parameters = {
#Let's try different max_depth for base_estimator
"base_estimator":[DecisionTreeClassifier(max_depth=1),DecisionTreeClassifier(max_depth=2),
DecisionTreeClassifier(max_depth=3)],
"n_estimators": np.arange(10,110,10),
"learning_rate":np.arange(0.1,2,0.1)
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.f1_score)
# Run the grid search
grid_obj = GridSearchCV(abc_tuned, parameters, scoring=scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
abc_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
abc_tuned.fit(X_train, y_train)
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3),
learning_rate=0.8, n_estimators=90, random_state=1)
#Calculating different metrics
abc_tuned_model_train_perf=model_performance_classification_sklearn(abc_tuned,X_train,y_train)
print(abc_tuned_model_train_perf)
abc_tuned_model_test_perf=model_performance_classification_sklearn(abc_tuned,X_test,y_test)
print(abc_tuned_model_test_perf)
#Creating confusion matrix
confusion_matrix_sklearn(abc_tuned,X_test,y_test)
Accuracy Recall Precision F1 0 0.974277 0.889752 0.971186 0.928687 Accuracy Recall Precision F1 0 0.874574 0.612319 0.686992 0.64751
# importance of features in the tree building
print(pd.DataFrame(abc_tuned.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
Imp MonthlyIncome 0.292217 Age 0.153584 DurationOfPitch 0.060687 PitchSatisfactionScore 0.058355 lg_DurationOfPitch 0.050601 NumberOfFollowups 0.050389 NumberOfTrips 0.043332 Passport 0.033010 CityTier 0.030130 Gender_Male 0.027682 PreferredPropertyStar 0.023088 Designation_Executive 0.017324 TypeofContact_Self Enquiry 0.017070 Occupation_Salaried 0.016676 ProductPitched_Super Deluxe 0.016084 MaritalStatus_Single 0.015734 Occupation_Large Business 0.015578 NumberOfChildrenVisiting 0.014447 MaritalStatus_Unmarried 0.010419 ProductPitched_Deluxe 0.010264 Designation_Manager 0.007545 NumberOfPersonVisiting 0.007218 Occupation_Small Business 0.006097 OwnCar 0.006043 ProductPitched_Standard 0.004672 Designation_Senior Manager 0.003342 MaritalStatus_Married 0.002897 ProductPitched_King 0.002666 TypeofContact_Others 0.001683 Designation_VP 0.001165
feature_names = X_train.columns
importances = abc_tuned.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
#Fitting the model
gb_classifier = GradientBoostingClassifier(random_state=1)
gb_classifier.fit(X_train,y_train)
#Calculating different metrics
gb_classifier_model_train_perf=model_performance_classification_sklearn(gb_classifier,X_train,y_train)
print("Training performance:\n",gb_classifier_model_train_perf)
gb_classifier_model_test_perf=model_performance_classification_sklearn(gb_classifier,X_test,y_test)
print("Testing performance:\n",gb_classifier_model_test_perf)
#Creating confusion matrix
confusion_matrix_sklearn(gb_classifier,X_test,y_test)
Training performance:
Accuracy Recall Precision F1
0 0.885414 0.444099 0.89375 0.593361
Testing performance:
Accuracy Recall Precision F1
0 0.867757 0.394928 0.801471 0.529126
# Choose the type of classifier.
gbc_tuned = GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),random_state=1)
# Grid of parameters to choose from
parameters = {
"n_estimators": [100,150,200,250],
"subsample":[0.8,0.9,1],
"max_features":[0.7,0.8,0.9,1]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.f1_score)
# Run the grid search
grid_obj = GridSearchCV(gbc_tuned, parameters, scoring=scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
gbc_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
gbc_tuned.fit(X_train, y_train)
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
max_features=0.9, n_estimators=250, random_state=1,
subsample=0.9)
#Calculating different metrics
gbc_tuned_model_train_perf=model_performance_classification_sklearn(gbc_tuned,X_train,y_train)
print("Training performance:\n",gbc_tuned_model_train_perf)
gbc_tuned_model_test_perf=model_performance_classification_sklearn(gbc_tuned,X_test,y_test)
print("Testing performance:\n",gbc_tuned_model_test_perf)
#Creating confusion matrix
confusion_matrix_sklearn(gbc_tuned,X_test,y_test)
Training performance:
Accuracy Recall Precision F1
0 0.924876 0.63354 0.951049 0.760485
Testing performance:
Accuracy Recall Precision F1
0 0.884117 0.496377 0.815476 0.617117
# importance of features in the tree building ( The importance of a feature is computed as the
#(normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print(pd.DataFrame(gbc_tuned.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
Imp MonthlyIncome 0.187546 Age 0.128734 Passport 0.118716 Designation_Executive 0.085603 CityTier 0.057971 lg_DurationOfPitch 0.054656 NumberOfFollowups 0.053271 MaritalStatus_Single 0.051093 NumberOfTrips 0.045507 DurationOfPitch 0.041019 PreferredPropertyStar 0.032864 PitchSatisfactionScore 0.031541 MaritalStatus_Unmarried 0.024992 TypeofContact_Self Enquiry 0.011849 Occupation_Large Business 0.011068 Designation_Senior Manager 0.010040 NumberOfPersonVisiting 0.007715 Gender_Male 0.006499 ProductPitched_Super Deluxe 0.005831 Designation_Manager 0.005786 NumberOfChildrenVisiting 0.005668 ProductPitched_Standard 0.005175 ProductPitched_Deluxe 0.003412 Occupation_Small Business 0.002943 OwnCar 0.002925 Occupation_Salaried 0.002660 MaritalStatus_Married 0.001608 ProductPitched_King 0.001291 Designation_VP 0.001144 TypeofContact_Others 0.000875
feature_names = X_train.columns
importances = gbc_tuned.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
#Fitting the model
xgb_classifier = XGBClassifier(random_state=1, eval_metric='logloss')
xgb_classifier.fit(X_train,y_train)
#Calculating different metrics
xgb_classifier_model_train_perf=model_performance_classification_sklearn(xgb_classifier,X_train,y_train)
print("Training performance:\n",xgb_classifier_model_train_perf)
xgb_classifier_model_test_perf=model_performance_classification_sklearn(xgb_classifier,X_test,y_test)
print("Testing performance:\n",xgb_classifier_model_test_perf)
#Creating confusion matrix
confusion_matrix_sklearn(xgb_classifier,X_test,y_test)
Training performance:
Accuracy Recall Precision F1
0 0.999415 0.996894 1.0 0.998445
Testing performance:
Accuracy Recall Precision F1
0 0.925017 0.684783 0.891509 0.77459
# Choose the type of classifier.
xgb_tuned = XGBClassifier(random_state=1, eval_metric='logloss')
# Grid of parameters to choose from
parameters = {
"n_estimators": [10,30,50],
"scale_pos_weight":[1,2,5],
"subsample":[0.7,0.9,1],
"learning_rate":[0.05, 0.1,0.2],
"colsample_bytree":[0.7,0.9,1],
"colsample_bylevel":[0.5,0.7,1]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.f1_score)
# Run the grid search
grid_obj = GridSearchCV(xgb_tuned, parameters,scoring=scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
xgb_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
xgb_tuned.fit(X_train, y_train)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
eval_metric='logloss', gamma=0, gpu_id=-1, importance_type=None,
interaction_constraints='', learning_rate=0.2, max_delta_step=0,
max_depth=6, min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=50, n_jobs=4,
num_parallel_tree=1, predictor='auto', random_state=1,
reg_alpha=0, reg_lambda=1, scale_pos_weight=5, subsample=1,
tree_method='exact', validate_parameters=1, verbosity=None)
#Calculating different metrics
xgb_tuned_model_train_perf=model_performance_classification_sklearn(xgb_tuned,X_train,y_train)
print("Training performance:\n",xgb_tuned_model_train_perf)
xgb_tuned_model_test_perf=model_performance_classification_sklearn(xgb_tuned,X_test,y_test)
print("Testing performance:\n",xgb_tuned_model_test_perf)
#Creating confusion matrix
confusion_matrix_sklearn(xgb_tuned,X_test,y_test)
Training performance:
Accuracy Recall Precision F1
0 0.974277 0.995342 0.88292 0.935766
Testing performance:
Accuracy Recall Precision F1
0 0.899796 0.815217 0.700935 0.753769
# importance of features in the tree building ( The importance of a feature is computed as the
#(normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print(pd.DataFrame(xgb_tuned.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
Imp Passport 0.139847 Designation_Executive 0.094679 ProductPitched_Super Deluxe 0.056133 MaritalStatus_Single 0.053401 CityTier 0.052300 MaritalStatus_Married 0.049967 MaritalStatus_Unmarried 0.046874 PreferredPropertyStar 0.035065 ProductPitched_Standard 0.033105 Occupation_Large Business 0.032916 Age 0.032502 NumberOfTrips 0.032291 NumberOfFollowups 0.031257 DurationOfPitch 0.030926 Occupation_Small Business 0.029915 ProductPitched_Deluxe 0.028690 PitchSatisfactionScore 0.028423 Occupation_Salaried 0.026825 MonthlyIncome 0.026550 ProductPitched_King 0.025864 Gender_Male 0.024496 TypeofContact_Self Enquiry 0.022416 OwnCar 0.018986 NumberOfChildrenVisiting 0.015875 TypeofContact_Others 0.015508 NumberOfPersonVisiting 0.015189 lg_DurationOfPitch 0.000000 Designation_Manager 0.000000 Designation_Senior Manager 0.000000 Designation_VP 0.000000
feature_names = X_train.columns
importances = xgb_tuned.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
Now, let's build a stacking model with the tuned models - decision tree, random forest, and gradient boosting, then use XGBoost to get the final prediction.
estimators = [('Random Forest',rf_estimator), ('Gradient Boosting',gbc_tuned), ('Decision Tree',d_tree)]
final_estimator = xgb_tuned
stacking_classifier= StackingClassifier(estimators=estimators,final_estimator=final_estimator)
stacking_classifier.fit(X_train,y_train)
StackingClassifier(estimators=[('Random Forest',
RandomForestClassifier(max_features=0.9,
n_estimators=110,
random_state=1)),
('Gradient Boosting',
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
max_features=0.9,
n_estimators=250,
random_state=1,
subsample=0.9)),
('Decision Tree',
DecisionTreeClassifier(random_state=1))],
final_estimator=XGBC...
gpu_id=-1,
importance_type=None,
interaction_constraints='',
learning_rate=0.2,
max_delta_step=0, max_depth=6,
min_child_weight=1,
missing=nan,
monotone_constraints='()',
n_estimators=50, n_jobs=4,
num_parallel_tree=1,
predictor='auto',
random_state=1, reg_alpha=0,
reg_lambda=1,
scale_pos_weight=5,
subsample=1,
tree_method='exact',
validate_parameters=1,
verbosity=None))
#Calculating different metrics
stacking_classifier_model_train_perf=model_performance_classification_sklearn(stacking_classifier,X_train,y_train)
print("Training performance:\n",stacking_classifier_model_train_perf)
stacking_classifier_model_test_perf=model_performance_classification_sklearn(stacking_classifier,X_test,y_test)
print("Testing performance:\n",stacking_classifier_model_test_perf)
#Creating confusion matrix
confusion_matrix_sklearn(stacking_classifier,X_test,y_test)
Training performance:
Accuracy Recall Precision F1
0 0.994154 1.0 0.96988 0.984709
Testing performance:
Accuracy Recall Precision F1
0 0.907294 0.880435 0.702312 0.78135
# training performance comparison
models_train_comp_df = pd.concat(
[dtree_model_train_perf.T, rf_model_train_perf.T,rf_wt_model_train_perf.T,
ab_classifier_model_train_perf.T,abc_tuned_model_train_perf.T,gb_classifier_model_train_perf.T,gbc_tuned_model_train_perf.T,
xgb_classifier_model_train_perf.T,xgb_tuned_model_train_perf.T,stacking_classifier_model_train_perf.T],
axis=1,
)
models_train_comp_df.columns = [
"Decision Tree",
"Random Forest Estimator",
"Random Forest Tuned",
"Adaboost Regressor",
"Adaboost Tuned",
"Gradient Boost Estimator",
"Gradient Boost Tuned",
"XGB Classifier",
"XGB Tuned",
"Stacking Classifier"
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Decision Tree | Random Forest Estimator | Random Forest Tuned | Adaboost Regressor | Adaboost Tuned | Gradient Boost Estimator | Gradient Boost Tuned | XGB Classifier | XGB Tuned | Stacking Classifier | |
|---|---|---|---|---|---|---|---|---|---|---|
| Accuracy | 1.0 | 1.0 | 1.0 | 0.843905 | 0.974277 | 0.885414 | 0.924876 | 0.999415 | 0.974277 | 0.994154 |
| Recall | 1.0 | 1.0 | 1.0 | 0.319876 | 0.889752 | 0.444099 | 0.633540 | 0.996894 | 0.995342 | 1.000000 |
| Precision | 1.0 | 1.0 | 1.0 | 0.682119 | 0.971186 | 0.893750 | 0.951049 | 1.000000 | 0.882920 | 0.969880 |
| F1 | 1.0 | 1.0 | 1.0 | 0.435518 | 0.928687 | 0.593361 | 0.760485 | 0.998445 | 0.935766 | 0.984709 |
# Testing performance comparison
models_test_comp_df = pd.concat(
[dtree_model_test_perf.T, rf_model_test_perf.T,rf_wt_model_test_perf.T,
ab_classifier_model_test_perf.T,abc_tuned_model_test_perf.T,gb_classifier_model_test_perf.T,gbc_tuned_model_test_perf.T,
xgb_classifier_model_test_perf.T,xgb_tuned_model_test_perf.T,stacking_classifier_model_test_perf.T],
axis=1,
)
models_test_comp_df.columns = [
"Decision Tree",
"Random Forest Estimator",
"Random Forest Tuned",
"Adaboost Regressor",
"Adaboost Tuned",
"Gradient Boost Estimator",
"Gradient Boost Tuned",
"XGB Classifier",
"XGB Tuned",
"Stacking Classifier"
]
print("Testing performance comparison:")
models_test_comp_df
Testing performance comparison:
| Decision Tree | Random Forest Estimator | Random Forest Tuned | Adaboost Regressor | Adaboost Tuned | Gradient Boost Estimator | Gradient Boost Tuned | XGB Classifier | XGB Tuned | Stacking Classifier | |
|---|---|---|---|---|---|---|---|---|---|---|
| Accuracy | 0.884117 | 0.912747 | 0.911384 | 0.849352 | 0.874574 | 0.867757 | 0.884117 | 0.925017 | 0.899796 | 0.907294 |
| Recall | 0.681159 | 0.579710 | 0.557971 | 0.326087 | 0.612319 | 0.394928 | 0.496377 | 0.684783 | 0.815217 | 0.880435 |
| Precision | 0.696296 | 0.930233 | 0.950617 | 0.720000 | 0.686992 | 0.801471 | 0.815476 | 0.891509 | 0.700935 | 0.702312 |
| F1 | 0.688645 | 0.714286 | 0.703196 | 0.448878 | 0.647510 | 0.529126 | 0.617117 | 0.774590 | 0.753769 | 0.781350 |
feature_names = X_train.columns
importances = xgb_classifier.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()